Little Knowledge Rules the Web: Domain-Centric Result Page Extraction

نویسندگان

  • Tim Furche
  • Georg Gottlob
  • Giovanni Grasso
  • Giorgio Orsi
  • Christian Schallhart
  • Cheng Wang
چکیده

Web extraction is the task of turning unstructured HTML into structured data. Previous approaches rely exclusively on detecting repeated structures in result pages. These approaches trade intensive user interaction for precision. In this paper, we introduce the Amber (“Adaptable Model-based Extraction of Result Pages”) system that replaces the human interaction with a domain ontology applicable to all sites of a domain. It models domain knowledge about (1) records and attributes of the domain, (2) low-level (textual) representations of these concepts, and (3) constraints linking representations to records and attributes. Parametrized with these constraints, otherwise domain-independent heuristics exploit the repeated structure of result pages to derive attributes and records. Amber is implemented in logical rules to allow an explicit formulation of the heuristics and easy adaptation to different domains. We apply Amber to the UK real estate domain where we achieve near perfect accuracy on a representative sample of 50 agency websites.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

S-CREAM: Semiautomatic CREAtion of Metadata

Richly interlinked, machine-understandable data constitute the basis for the Semantic Web. We provide a framework, SCREAM, that allows for creation of metadata and is trainable for a specific domain. Annotating web documents is one of the major techniques for creating metadata on the web. The implementation of S-CREAM, OntoMat supports now the semi-automatic annotation of web pages. This semi-a...

متن کامل

Grammatical inference for information extraction and visualisation on the Web

The world-wide web contains a wealth of database-style information scattered across different sites that could be better used if it were integrated into a single view. Since document formats vary widely between sites and frequently mingle structural with presentation markup, extracting and integrating data from web pages is a difficult challenge. Manually writing extraction wrappers is expensiv...

متن کامل

Knowledge Extraction by Using an Ontology Based Annotation Tool

This paper describes a Semantic Annotation Tool for extraction of knowledge structures from web pages through the use of simple user-defined knowledge extraction patterns. The semantic annotation tool contains: an ontology-based mark-up component which allows the user to browse and to mark-up relevant pieces of information; a learning component (Crystal from the University of Massachusetts at A...

متن کامل

Building Intelligent Systems for Mining Information Extraction Rules from Web Pages by Using Domain Knowledge

Previous researches on automatic information extraction experienced difficulties in acquiring and representing useful domain knowledge and in coping with the structural heterogeneity among different information sources. As a result, many real-world information sources with complex document structures could not be correctly analyzed. In order to resolve these problems, this paper presents a meth...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011